Creating a Persian-English Comparable Corpus
نویسندگان
چکیده
Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in English and Hamshahri news in Persian. We use the similarity of the document topics and their publication dates to align the documents in these sets. We tried several alternatives for constructing the comparable corpora and assessed the quality of the corpora using different criteria. Evaluation results show the high quality of the aligned documents and using the Persian-English comparable corpus for extracting translation knowledge seems promising.
منابع مشابه
Creation of comparable corpora for English-Urdu, Arabic, Persian
Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. However, in the case of under-resourced languages or some specific domains, parallel corpora are not readily available. This leads to under-performing machine translation systems in those sparse data settings. To overcome the low availability of parallel resources the machine translation community has rec...
متن کاملExtracting Persian-English Parallel Sentences from Document Level Aligned Comparable Corpus using Bi-Directional Translation
Bilingual parallel corpora are very important in various filed of natural language processing (NLP). The quality of a Statistical Machine Translation (SMT) system strongly dependent upon the amount of training data. For low resource language pairs such as Persian-English, there are not enough parallel sentences to build an accurate SMT system. This paper describes a new approach to use the Wiki...
متن کاملUsing English as Pivot to Extract Persian-Italian Parallel Sentences from Non-Parallel Corpora
Ebrahim Ansari ([email protected]) et al. 2017. Using english as pivot to extract persian-italian parallel sentences from non-parallel corpora. In " Applications of Comparable Corpora " edited book Berlin Linguistic Press (ed.). The effectiveness of a statistical machine translation system (SMT) is very dependent upon the amount of parallel corpus used in the training phase. For low-resource l...
متن کاملCreating a Feasible Corpus for Persian POS Tagging
This paper describes creation of a test collection for Persian Part of Speech Tagging experiments. This collection was created by modifying a manually Part of Speech (POS) tagged Persian corpus with over two million tagged words. The original collection had a tag set of 550 tags that are more than what any machine learning algorithm can handle. The number of tags for these experiments was reduc...
متن کاملCreation of a Doctor-Patient Dialogue Corpus Using Standardized Patients
In this paper we describe the development of a doctor-patient dialogue corpus to support a speech-to-speech machine translation effort for English-Persian medical dialogues. The corpus was developed by recording and transcribing English-to-English dialogues between medical students and standardized patients (actors who have been trained to portray illness or injury victims), and then translated...
متن کامل